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Abstract 

Coordinating multiple agents that need to perform a sequence 
of actions to maximize a system level reward requires solving 
two distinct credit assignment problems. First, credit must be 
assigned for an action taken at time step t that results in a 
reward at time step t f > t. Second, credit must be assigned 
for the contribution of agent i to the overall system perfor- 
mance. The first credit assignment problem is typically ad- 
dressed with temporal difference methods such as Q-leaming. 
The second credit assignment problem is typically addressed 
by creating custom reward functions. To address both credit 
assignment problems simultaneously, we propose the “Q 
Updates with Immediate Counterfactual Rewards-leaming” 
(QUICR-leaming) designed to improve both the convergence 
properties and performance of Q-leaming in large multi-agent 
problems. QUICR-leaming is based on previous work on 
single-time-step counterfactual rewards described by the col- 
lectives framework. Results on a traffic congestion problem 
shows that QUICR-leaming is significantly better than a Q- 
leamer using collectives-based (single-time-step counterfac- 
tual) rewards. In addition QUICR-leaming provides signifi- 
cant gains over conventional and local Q-leaming. Additional 
results on a multi-agent grid- world problem show that the im- 
provements due to QUICR-leaming are not domain specific 
and can provide up to a ten fold increase in performance over 
existing methods. 

Introduction 

Coordinating a set of interacting agents that take sequences 
of actions to maximize a system level performance criteria 
is a difficult problem. Addressing this problem with a large 
single agent reinforcement learning algorithm is ineffective 
in general because the state- space becomes prohibitively 
large. A more promising approach is to give each agent in 
the multi-agent system its own reinforcement learner. This 
approach, however, introduces a new problem: how to as- 
sign credit for the contribution of an agent to the system 
performance,, which in general is a function of all agents. 
Allowing each agent to try to maximize the system level 
global reward is problematic in all but the smallest prob- 
lems as an agent’s reward is masked by the actions of all the 
other agents in the system. In Markov Decisions Problems 
(MDPs) presented in this paper, the global reward may be 
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influenced by as many as 800 actions (actions of 40 agents 
over 20 time steps). Purely local rewards allow us to over- 
come this “signal-to-noise” problem. On the other hand, lo- 
cal rewards are problematic, since there are no guarantees 
that policies formed by agents that maximize their local re- 
ward will also maximize the global reward. 

In this paper, we present “Q Updates with Immediate 
Counterfactual Rewards learning” (QUICR-leaming) which 
uses agent-specific rewards that ensure fast convergence in 
multi-agent coordination domains. Rewards in QUICR- 
leaming are both heavily agent- sensitive, making the learn- 
ing task easier, and aligned with the system level goal, en- 
suring that agents receiving high rewards are helping the sys- 
tem as a whole. QUICR-leaming uses standard temporal 
difference methods but because of its unique reward struc- 
ture, provides significantly faster convergence than standard 
Q-leaming in large multi-agent systems. In the next Sec- 
tion, we present a brief summary of the related research. We 
then discuss the temporal and structural credit assignment 
problems in multi-agent systems, and describe the QUICR- 
leaming algorithm. The following two sections present re- 
sults on two different problems that require coordination, 
showing that QUICR-leaming performs up to ten times bet- 
ter than standard Q-leaming in multi- agent coordination 
problems. Finally we discuss the implications and limita- 
tions of QUICR-leaming and highlight future research di- 
rections. 

Previous Work 

Currently the best multi-agent learning algorithms used in 
coordinating agents address the structural credit assignment 
problem by leveraging domain knowledge. In robotic soccer 
for example, player specific subtasks are used, followed by 
tiling to provide good convergence properties (Stone, Sut- 
ton, & Kuhlmann 2005). In a robot coordination problem for 
the foraging domain, specific mles induce good division of 
labor (Jones & Mataric 2003). In domains where groups of 
agents can be assumed to be independent, the task can be de- 
composed by learning a set basis functions used to represent 
the value function, where each basis only processes a small 
number of the state variables (Guestrin, Lagoudakis, & Parr 
2002). Also multi-agent Partially Observable Markov Deci- 
sion Precesses (POMDPs) can be, simplified through piece- 
wise linear rewards (Nair et al 2003). There have also 



been several approaches to optimizing Q-learning in multi- 
agent systems that do not use independence assumptions. 
For a small number of agents, game theoretic techniques 
were shown to lead to a multi-agent Q-leaming algorithm 
proven to converge (Hu & Wellman 1998). In addition, the 
equivalence between structural and temporal credit assign- 
ment was shown in (Agogino & Turner 2004), and methods 
based on Bayesian methods were shown to improve multi- 
agent learning by providing better exploration capabilities 
(Chalkiadakis & Boutilier 2003). Finally, task decomposi- 
tion in single agent RL can be achieved using hierarchical 
reinforcement learning methods such as MAXQ value func- 
tion decomposition (Dietterich 2000). 

Credit Assignment Problem 

The multi-agent temporal credit assignment problem con- 
sists of determining how to assign rewards (e.g., credit) for 
a sequence of actions. Starting from the current time step t, 
the undiscounted sum of rewards till a final time step T can 
be represented by: 

T—t 

-^t( 5 t(u)) = ^ ^ * (1) 

k = 0 

where a is a vector containing the actions of all agents at all 
time steps, s t (a) is the state function returning the state of 
all agents for a single time step, and r t {s) is the single-time- 
step reward function, which is a function of the states of all 
of the agents. 

This reward is a function of all of the previous actions of 
all of the agents. Every reward is a function of the states of 
all the agents, and every- state is a function of all the actions 
that preceded it (even though it is Markovian, the previous 
states ultimately depend on previous actions). In a system 
with n agents, a reward received on the last time step can be 
affected by up to n * T actions. Looking at rewards received 
at different time steps, on average | n*T actions may affect 
a reward in tightly coupled systems. Agents need to use 
this reward to evaluate their single action; in the domains 
presented in the results sections with forty agents and twenty 
time steps there are up to 800 actions affecting the reward! 

Standard Q-Learning 

Reinforcement learners such as Q-leaming address (though 
imperfectly) how to assign credit of future rewards to an 
agent’s current action. The goal of Q-leaming is to create a 
policy that maximizes the sum of future rewards, R t (st (a)) f 
from the current state (Kaelbling, Littman, & Moore 1996; 
Sutton & Barto 1998; Watkins & Dayan 1992). It does this 
by maintaining tables of Q-values, which estimate the ex- 
pected sum of future rewards for a particular action in a par- 
ticular state. In the TD(0) version of Q-leaming, a Q- value, 
Q(s t , at), is updated with the following Q-leaming rule l : 

Q(st , <h) = r t + max a Q(s t+1 , a) . ( 2 ) 

! To simplify notation, this paper uses Q-learning update no- 
tation for deterministic (where learning rate a = 1 converges), 
undiscounted Q-leaming. The extensions to non-deterministic and 
discounted cases through the addition of learning rate and discount- 
ing parameters are straight-forward. 


This update assumes that the action a t is most responsible 
for the immediate reward r t , and is less responsible for the 

sum of future rewards, Y*k=i r t+fc ( 5 t+/c( a ))* This assump- 
tion is reasonable since rewards in the future are affected by 
uncertain future actions and noise in state transitions. In- 
stead of using the sum of future rewards directly to update 
its table, Q-leaming uses a Q- value from the next state en- 
tered as an estimate for those future rewards. Under certain 
assumptions, Q-values are shown to converge to the actual 
value of the future rewards (Watkins & Dayan 1992 ), 

Even though Q-leaming addresses the temporal credit as- 
signment problem (i.e., tries to apportion the effects of all 
actions taken at other time steps to the current reward), it 
does not address the structural credit assignment problem 
(i.e., how to apportion credit to the individual agents in the 
system). As a result when many agents need to coordinate 
their actions, standard Q-learning is generally slow since it 
needs all agents to tease out their time dependent contribu- 
tion to the global system performance based on the global 
reward they receive. An additional issue with standard Q- 
leaming is that in general an agent needs to fully observe 
the actions of other agents in order to compute its reward. 
This requirement is reduced in the “Local Q-Leaming” and 
“QUICR-Learning” algorithms presented in this paper. 

Local Q-Learning 

One way to address the structural credit assignment problem 
and allow for fast learning is to assume that agents’ actions 
are independent. Without this assumption, the immediate 
reward function for a multi-agent reward system may be a 
function of all the states: 

r t( s t, 1 ( a l)5 s t, 2 (^ 1 ), . • • , St,n(p'7i)) > 

where s t ,i(ai) is the state for agent i and is a function of 
only agent i’s previous actions. The number of states deter- 
mining the reward grows linearly with the number of agents, 
while the number of actions that determine each state grows 
linearly with the number of time steps. To reduce the huge 
number of actions that affect this reward, often the reward is 
assumed to be linearly separable: 

r t (s t ) = 'y]w i rt,i(st,i(ai)) ■ 

i 

Then each agent receives a reward r t ,i which is only a func- 
tion of its action. Q-leaming is then used to resolve the re- 
maining temporal credit assignment problem. If the agents 
are indeed independent and their pursuing their local ob- 
jectives has no deleterious side effects on each other, this 
method leads to a significant speedup in learning rates as 
an agent receives direct credit for its actions. However, if 
the agents are coupled, then though local Q-leaming will al- 
low fast convergence, the agents will tend to converge to the 
wrong policies (i.e., policies that are not globally desirable). 
In the worst case of strong agent coupling, this can lead to 
worse than random performance (Wolpert & Turner 2001 )). 

QUICR-Learning 

In this section we present QUICR-leaming, a learning al- 
gorithm for multi-agent systems that does not assume that 



the system reward function is linearly separable. Instead it 
uses a mechanism for creating rewards that are a function 
of all of the agents, but still provide many of the benefits 
of hand-crafted rewards. In particular, QUICR-leaming re- 
wards have: 

L high “alignment” with the overall learning task. 

2. high “sensitivity” to the actions of the agent. 

The first property of alignment means that when an agent 
maximizes its own reward it tends to maximize the over- 
all system reward. Without this property, a large multi- 
agent system can lead to agents performing useless work, or 
worse, working at cross-purposes. Having aligned rewards 
is critical to multi-agent coordination. Reward sensitivity 
means that an agent’s reward is more sensitive to its own ac- 
tions than to other agents’ actions. This property is impor- 
tant for agents to learn quickly. Note that assigning the full 
system reward to all the agents (e.g., standard Q-leaming) 
has low agent-sensitivity, since each agent’s reward depends 
on the actions of all the other agents. 

QUICR-leaming is based on providing agents with re- 
wards that are both aligned with the system goals and sen- 
sitive to the agent’s states. It aims to provide the bene- 
fits of customizing rewards without requiring detailed do- 
main knowledge. In a task where the reward can be ex- 
pressed as in Equation 1, let us introduce the difference re- 
ward (adapted from (Wolpert & Turner 2001)) given by: 

= Rt( s t{a ? 

where a — a ti i denotes a counterfactual state where agent i 
has not taken the action it took in time step t (e.g., the action 
of agent i has been removed from the vector containing the 
actions of all the agents before the system state has been 
computed). Decomposing further, we obtain: 

T-t 

k = 0 

T-t 

— ^ ^ (^i+fc (&) j s t-+-k (fl a t } i)) * (3) 

,fc= 0 

where d*(si, $ 2 ) — r t (si) —r t (s 2 ). (We introduce the single 
time step “difference” reward d t to keep the parallel between 
Equations 1 and 3). This reward is more sensitive to an 
agent’s action than r t since much of the effects of the other 
agents are subtracted out with the counterfactual (Wolpert 
& Turner 2001). In addition often an agent does not need 
to fully observe the actions of other agents to compute the 
difference reward, since in many domains the subtraction 
cancels out many of the variables. Unfortunately in general 
dt{$u s 2 ) is non-Markovian since the second parameter may 
depend on previous states, making its use troublesome in a 
learning task involving both a temporal and structural credit 
assignment (This difficulty is examined further below.) 

In order to overcome this shortcoming of Equation 3, let 
us make the following assumption: 

1. The counterfactual action a — a t ,i moves agent i to an 
absorbing state, s&. 


2. Sb is independent of the agent’s current (or previous) 

state(s). 

These assumptions are necessary to have the Q-table back- 
ups approximate the full time-extended difference reward 
given in equation 3. Forcing the counterfactual action 
a — a ti i to move the agent into an absorbing state is nec- 
essary to enable the computation of equation 3. Without this 
assumption the ramifications of a — a tj i would have to be 
forward propagated through time to compute D\. The sec- 
ond assumption is necessary to satisfy the Markov property 
of the system in a subtle way. While the next state s t +i 
caused by action a* is generally a function of the current 
state, the absorbing state s& caused by the action a — a t ,i 
should be independent of the current state in order for the 
Q-table updates to propagate correctly. The problem here is 
that Q-table backups are based on the next state entered, not 
the counterfactual state entered. The experimental results 
sections later in this paper show examples of the problems 
caused when this assumption is broken. In addition to these 
assumptions, the state of an agent should not be a direct 
function of the actions of other agents, otherwise we would 
have to compute the effects of counterfactual action a — a t j 
on all the the agents. However, this does not mean that the 
agents in the system are independent. They still strongly 
influence each other through the system reward, which in 
general is nonlinear. 

Given these conditions, the counterfactual state for time 
1 4- k is computed from the actual state at time t 4 k, by 
replacing the state of agent i at time t with s&. Now the 
difference reward can be made into a Markovian function: 

d\{s t ) = r t (s t ) - r t (s t - s t ,i + s b ) , ( 4 ) 

where the expression s t — s ty i 4* denotes replacing agent 
V s state with state 

Now the Q-leaming rule can be applied to the difference 
reward, resulting in the QUICR-leaming rule: 

Quicr{su at) = r t (s t ) - r t (s t - s t ,i 4 - s b ) 

+ max a Q(s t+ i,a) 

d\{s t ) + max a Q(s t +u a ) • 0 >) 

Note that since this learning rule is Q-leaming, albeit ap- 
plied to a different reward structure, it shares all the conver- 
gence properties of Q-leaming. In order to show that Equa- 
tion 5 leads to good system level behavior, we need to show 
that agent i maximizing dl(s t ) (e.g., following Equation 5 ) 
will maximize the system reward r t . Note that by definition 
(s t — s ti i -hSb) is independent of the actions of agent i, since 
it is formed by moving agent i to the absorbing state Sb from 
which it cannot emerge. This effectively means the partial 
differential of d\{s t ) with respect to agent i is 2 : 

d d 

— 4(s t ) = — ( r t {s t ) - r t (s t - s t ,i + s b )) 

■Si ^Si 

2 Though in this work we show this result for differentiable . 
states, the principle applies to more general states, including dis- 
crete states. 



= -sr n(st ) - w-r t (s t - «*,«■+ s b ) 

a Si Usi 

d 

- a-n(s t ) - o 

°Si 

= 4-n{s t ). (6) 

VSi 

Therefore any agent i using a learning algorithm to opti- 
mize dl(s t ) will tend to optimize r t (s t ). 

QUICR-Learning and WLU 

The difference reward in QUICR-Learning, d\{s t ) = 
r t (s t ) — s tl i + si,), is closely related to the ‘Wonder- 

ful Life Utility” used in multiple non-time-extended prob- 
lems (Wolpert & Turner 2001): 

WLU\s) - r(s) - r(s - * 4- c) , (7) ■ 

where c is independent of state s*. The strait-forward con- 
version of this into single-time-step rewards is: 

WLUlist) - r(s t ) - r(s t - s M 4- c t ) , (8) 

where c* is independent of state s t . The reward d\{s-t) is 
a form of WLU^(s t ) that places greater restriction on c*; 
it must be independent of all previous states and should be 
an absorbing state. Without these restrictions WLU l cre- 
ates problems with reward alignment and sensitivity. With- 
out the restrictions, WLUl is aligned with the system re- 
ward for single-time-step problems since c t is independent 
of the agent's current state. The subtle difficulty is that val- 
ues of WLUl § et propagated back to previous states through 
Q-leaming. If ct is not independent of all previous states, 
values that are not aligned with the system reward may be 
propagated back to previous states. While these differences 
sometimes do not matter, experimental results presented 
later in this paper show that they are often important. Hav- 
ing be an absorbing state (as done in QUICR-leaming) is 
needed to keep the learners’ Q values approximate the time 
extended difference reward D\(st (a)). 

Traffic Congestion Experiment 

To evaluate the performance of QUICR-leaming, we per- 
form experiments that test the ability of agents to maximize a 
reward based on an abstract traffic simulation. In this exper- 
iment n drivers can take a combination of m roads to make 
it to their destination. Each road j has an ideal capacity Cj 
representing the size of the road. In addition each road has a 
weighting value wj representing a driver’s benefit from tak- 
ing the road. This weighting value can be used to represent 
such properties such as a road’s difficulty to drive on and 
convenience to destination. In this experiment a driver starts 
on a road chosen randomly. At every time step, the driver 
can choose to stay on the same road or to transfer to one of 
two adjacent roads. In order to test the ability of learners to 
perform long term planning, the global reward is zero for all 
time steps, except for the last time step when it is computed 
as follows: 

n = Lj > (9) 

3 

where kj it is the number of drivers on road j at time t. 


Learning Algorithms 

In our experiments for both this traffic congestion problem 
and the grid world problem (presented in the next section) 
we tested the multi-agent system using variants of the tem- 
poral difference method with A = 0 (TD(0)). The actions of 
the agents were chosen using an epsilon-greedy exploration 
scheme and tables were initially set to zero with ties broken 
randomly (in the traffic congestion experiment e was set to 
0.05 and in the multi-agent grid world experiment e was set 
to 0.15). In this case, there were 60 agents taking actions for 
six consecutive time steps. The learning rate was set to 0.5 
(however to simplify notation we do not show the learning 
rates in the update equations). The four algorithms are as 
follows: 

• Standard Q-leaming is based on the full reward r t : 

Q(s t , a t ) — r t (s t ) 4- max a Q(s t +i 7 a ) , (10) 

• Local Q-leaming is only a function of the specific driver’s 
own road, j : 

Qioc(st, at) = k jit e c i 4- max a Qioc[st+i, a) . 

• QUICR-leaming instead updates with a reward that is a 
function of all of the states, but uses counterfactuals to 
suppress the effect of other driver’s actions: 

QuiCR(st,a t ) = r t (s t ) - r t (s t -s t ,i+s b ) 

+max a QuiCR{st+u a ) , 

where s t — s t ,i + s& is the state resulting from removing 
agent i's state and replacing it with the absorbing state s b . 

• WLU t Q-leaming is similar to QUICR-leaming, but uses 
a simpler form of counterfactual state. Instead of replac- 
ing the state s t ,i by the absorbing state s b , it is replaced 
by the state that the driver would have been in if he had 
taken action 0, which in this case is the same road he was 
on the previous time step : s t - i,j. The resulting update 
equation is: 

QwLu(s t , a t ) = r t (s t ) - r t (s t - s tyi 4- s t - i,i) 
+max a QwLu(s t +u a) . 

Results 

Experimental results on the traffic congestion problem show 
that QUICR-leaming learns more quickly and achieves a 
higher level of performance than the other learning meth- 
ods (Figure 1). While standard Q-leaming is able to im- 
prove performance with time, it learns very slowly. This 
slow learning speed is caused by Q-leaming’s use of the full 
reward r t (s t ), which is a function of the actions of all the 
other drivers. When a driver takes an action that is benefi- 
cial, the driver may still receive a poor reward if some of the 
fifty nine other drivers took poor actions at the same time. 
In contrast, local Q-leaming learns quickly, but since it uses 
a reward that is not aligned with the system reward, drivers 
using local Q-leaming eventually learn to take bad actions. 
Early in learning drivers using local Q-leaming perform well 
as they learn to use the roads with high capacity and higher 



weighting. However, as learning progresses the drivers start 
overusing the roads with high weighting, since their reward 
does not take into account that using other roads would ben- 
efit the system as a whole. This system creates a classic 
Tragedy of the Commons scenario. By over utilizing the 
“beneficial” roads, the drivers end up being worse off than 
if they had acted in a cooperative manner. Drivers using 
WLU t Q-leaming have similar problems because although 
they are aligned at each time step, they are not aligned across 
time steps. 



Figure 1: Traffic Congestion Problem (60 Agents). 


Multi-agent Grid World Experiment 

The second set of experiments we conducted involved a 
standard grid world problem (Sutton & Barto 1998). In this 
problem, at each time step, the agent can move up, down, 
right or left one grid square, and receives a reward (possi- 
bly zero) after each move. The observable state space for 
the agent is its grid coordinate and the reward it receives de- 
pends on the grid square to which it moves. In the episodic 
version, which is the focus of this paper, the agent moves 
for a fixed number of time steps, and then is returned to its 
starting location. 

In the multi-agent version of the problem there are mul- 
tiple agents navigating the grid simultaneously influencing 
each others’ rewards. In this problem agents are rewarded 
for observing tokens located in the grid. Each token has a 
value between zero and one, and each grid square can have 
at most one token. When an agent moves into a grid square, 
it observes a token and receives a reward for the value of the 
token. Rewards are only received on the first observation of 
the token. Future observations from the agent or other agents 
do not receive rewards in the same episode. More precisely, 
r t is computed by: 

(id 

i 3 

where P is the indicator function which returns one when 
an agent in state is in the location of an unobserved to- 
ken Lj. The global objective of the multi- agent grid world 


problem is to observe the highest aggregated value of tokens 
in a fixed number of time steps T. 

Learning Algorithms 

As in the traffic congestion problem, we test the perfor- 
mance of the following four learning methods: 

• Standard Q-leaming is based on the full reward r t : 

Q{s t , at) = r t (s t ) + max a Q{s t +i,a) . (12) 

• Local Q-leaming is only a function of the specific agent’s 
own state: 

Qloc{^ti a t) 5=5 4 " 

• QUICR-leaming instead updates with a reward that is a 
function of all of the states, but uses counterfactuals to 
suppress the effect of other agents’ actions: 

QuiCR^Sii ®t) ^ 

+ max a QuiCR{st+ua) , 

where s t — s t ^ -f is the state resulting from removing 
agent i f s state and replacing it with the absorbing state s^. 

• WLU Q-leaming is similar to QUICR-leaming, but uses 
a different counterfactual state. Instead of replacing the 
state s tii by the absorbing state $t, it is replaced by the 
state that the agent would have been in if he had taken 
action 0, which causes the agent to move to the right: 

QwLu(st , a t ) = r t {s t ) - r t (s t - s ±ii + 

+ max a QwLu( s t+i, a ) > 

where is the state to the right of 

Results 

In this experiment we use a token distribution where the 
“highly valued” tokens are concentrated in one comer, with 
a second concentration near the center where the rovers are 
initially located. This experiment is described in more detail 
in (Turner, Agogino, & Wolpert 2002). 



Time 

Figure 2: Multi- Agent Grid World Problem (40 Agents). 





Figure 2 shows the performance for 40 agents on a 400 
unit-square world for episodes of 20 time steps (error bars 
of ± one a are included). The performance measure in 
these figures is the sum of full rewards (r t (s t )) received in 
an episode, normalized so that the maximum reward achiev- 
able is 1.0. Note all learning methods are evaluated on the 
same reward function, independent of the reward function 
that they are internally using to assign credit to the agents. 

The results show that local Q-leaming generally produces 
poor results. This problem is caused by all agents aiming 
to acquire the most valuable tokens, and congregating to- 
wards the corner of the world where such tokens are lo- 
cated. In essence, in this case agents using local Q-leaming 
compete, rather than cooperate. The agents using standard 
Q-leaming do not fare better, as the agents are plagued by 
the credit assignment problem associated with each agent re- 
ceiving the full world reward for each individual action they 
take. Agents using QUICR-learaing on the other hand learn 
rapidly, outperforming both local and standard Q-leaming 
by a factor of six (over random rovers). Agents using W LU t 
Q-leaming eventually achieve high performance, but learn 
three times more slowly than agents using QUICR-leaming. 

Discussion 

Using Q-leaming to learn a control policy for a single agent 
in a coordination problem with many agents is difficult, be- 
cause an agent will often have little influence over the re- 
ward it is trying to maximize. In our examples, an agent’s 
reward received after an action could be influenced by as 
many as 800 other actions from other time-steps and other 
agents. Even temporal difference methods that perform well 
in single agent systems will be overwhelmed by the num- 
ber of actions influencing a reward in the multi-agent set- 
ting. To address this problem, this paper introduces QUICR- 
learaing, which aims at reducing the impact of other agent’s 
actions without assuming linearly separable reward func- 
tions. Within the Q-leaming framework, QUICR-leaming 
uses the difference reward computed with immediate coun- 
terfactuals. While eliminating much of the influence of other 
agents, this reward is shown mathematically to be aligned 
with the global reward: agents maximizing the difference 
reward will also be maximizing the global reward. Exper- 
imental results in a traffic congestion problem and a grid 
world problem confirm the analysis, showing that QUICR- 
leaming learns in less time than standard Q-leaming, and 
achieves better results than Q-leaming variants that use local 
rewards and assume linear separability. While this method 
was used with TD(0) Q-leaming updates, it also extends to 
TD(A), Sarsa-leaming and Monte Carlo estimation. 

In our experiments an agent’s state is never directly in- 
fluenced by the actions of other agents. Despite this, the 
agents are still tightly coupled by virtue of their reward. 
Agents can, and did affect each other’s ability to achieve 
high rewards, adding complexity that does not exist in sys- 
tems where agents are independent. In addition even though 
agents do not directly influence each other’s states, they in- 
directly affect each other through learning: an agent’s ac- 
tions can impact another agent’s reward, and the agents se- 
lect actions based on previous rewards received. Hence an 


agent’s action at time step t does affect other agents at t! > t 
through their learning algorithms. The mathematics in this 
paper does not address these indirect influences and is a sub- 
ject of further research. However, experimental evidence 
shows that agents can still cooperate despite these indirect 
effects. In fact even when agents directly influence each 
other’s states, in practice they may still cooperate effectively 
as long as they use agent-sensitive rewards that are aligned 
with the system reward as has been shown in experiments 
presented in (Agogino & Turner 2004) . 
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